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I Claim: 

1 . A method of processing video comprising: 

a. configuring a plurality of processing elements into a two-dimensional array of 
processing elements such that each processing element includes a plurality of 
vector registers, a plurality of block registers, a plurality of scalar registers, and a 
plurality of arithmetic logic units, wherein a data path of each processing element 
includes a set of processing element slices each coupled to one arithmetic logic 
unit such that each arithmetic logic unit receives a specified portion of each vector 
register as input; 

b. configuring a video stream into data blocks; 

c. loading the data blocks into the plurality of vector registers within each 
processing element; 

d. reading the specified portions of each vector register by each of the corresponding 
arithmetic logic units within all processing elements simultaneously; and 

e. processing the read portions by the arithmetic logic units such that the data blocks 
from the plurality of vector registers are processed in parallel. 

2. The method of claim 1 wherein loading the data blocks into the plurality of vector 
registers, reading the specified portions of each vector register, processing the read 
portions, and writing the results of the processing is performed within one processing 
element clock cycle. 

3. The method of claim 1 wherein each arithmetic logic unit is further configured to receive 
the contents of one of the scalar registers as input. 

4. The method of claim 3 wherein when the contents of one of the scalar registers is used as 
input, a scalar register value is broadcast from the one scalar register to all arithmetic 
logic units within a given processing element. 
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5. The method of claim 1 wherein the plurality of processing elements are further 
configured such that all processing elements within a same column of the array form a 
processing slice, and each processing element further comprises a local accumulation 
register to accumulate the results of the plurality of arithmetic logic units within the 
processing element, further wherein each processing slice is coupled to a slice 
accumulator such that each slice accumulator buffers the results from any one of the local 
accumulation registers corresponding to the processing elements of the processing slice. 

6. The method of claim 5 further comprising accumulating the results of each slice 
accumulator into a global accumulator register. 

7. The method of claim 1 further comprising loading the data block into the plurality of 
block registers prior to loading the data blocks into the plurality of vector registers, such 
that the data blocks are loaded into the plurality of vector registers from the plurality of 
block registers. 

8. The method of claim 1 further comprising writing the results of the processing performed 
by the arithmetic logic units to the plurality of vector registers. 

9. An apparatus to process video, the apparatus comprising: 

a. a main memory; and 

b. a two-dimensional array of processing elements, wherein each processing element 
includes a plurality of vector registers, a plurality of block registers, a plurality of 
scalar registers, and a plurality of arithmetic logic units, further wherein a data 
path of each processing element includes a set of processing element slices each 
coupled to one arithmetic logic unit such that each arithmetic logic unit receives a 
specified portion of each vector register as input; 

wherein video data is received by the main memory and configured as data blocks, the 
data blocks are loaded into the plurality of vector registers within each processing 
element, specified portions of each vector register are read by each of the corresponding 
arithmetic logic units simultaneously within all processing elements, the read portions are 
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processed by the arithmetic logic units such that the data blocks from the plurality of 
vector registers are processed in parallel, and the results of the processing performed by 
the arithmetic logic units are written to the plurality of vector registers. 

10. The apparatus of claim 9 wherein each arithmetic logic unit is further configured to 
receive the contents of one of the scalar registers as input. 

1 1 . The apparatus of claim 10 wherein when the contents of one of the scalar registers is used 
as input, a scalar register value is broadcast from the one scalar register to all arithmetic 
logic units within a given processing element. 

12. The apparatus of claim 9 wherein the plurality of processing elements are further 
configured such that all processing elements within a same column of the array form a 
processing slice, and each processing element further comprises a local accumulation 
register to accumulate the results of the plurality of arithmetic logic units within the 
processing element, further wherein each processing slice is coupled to a slice 
accumulator such that each slice accumulator buffers the results from any one of the local 
accumulation registers corresponding to the processing elements of the processing slice. 

13. The apparatus of claim 12 further comprising a global accumulator unit to accumulate the 
results of each slice accumulator into a global accumulation register. 

14. The apparatus of claim 9 wherein the plurality of block registers receives the data blocks 
from the main memory and the plurality of vector registers receives the data blocks from 
the plurality of block registers. 

15. The apparatus of claim 9 wherein each processing element further comprises a mask 
register to exclude processing of the processing element. 

16. An apparatus to process video comprising: 

a. means for configuring a plurality of processing elements into a two-dimensional 
array of processing elements such that each processing element includes a 
plurality of vector registers, a plurality of block registers, a plurality of scalar 
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registers, and a plurality of arithmetic logic units, wherein a data path of each 
processing element includes a set of processing element slices each coupled to one 
arithmetic logic unit such that each arithmetic logic unit receives a specified 
portion of each vector register as input; 

b. means for configuring a video stream into data blocks coupled to the means for 
configuring a plurality of processing elements; 

c. means for loading the data blocks into the plurality of vector registers within each 
processing element, the means for loading coupled to the means for configuring 
the video stream; 

d. means for reading the specified portions of each vector register by each of the 
corresponding arithmetic logic units within all processing elements 
simultaneously, the means for reading coupled to the means for loading; and 

e. means for processing the read portions by the arithmetic logic units such that the 
data blocks from the plurality of vector registers is processed in parallel, the 
means for processing coupled to the means for reading. 

17. The apparatus of claim 16 wherein the means for loading the data blocks into the plurality 
of vector registers, the means for reading the specified portions of each vector register, 
the means for processing the read portions, and the means for writing the results of the 
processing each operate within one processing element clock cycle. 

18. The apparatus of claim 16 wherein each arithmetic logic unit is further configured to 
receive the contents of one of the scalar registers as input. 

19. The apparatus of claim 18 wherein when the contents of one of the scalar registers is used 
as input, a scalar register value is broadcast from the one scalar register to all arithmetic 
logic units within a given processing element. 

20. The apparatus of claim 16 wherein the plurality of processing elements are further 
configured such that all processing elements within a same column of the array form a 
processing slice, and each processing element further comprises a local accumulation 
register to accumulate the results of the plurality of arithmetic logic units within the 
processing element, further wherein each processing slice is coupled to a slice 
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accumulator such that each slice accumulator buffers the results from any one of the local 
accumulation registers corresponding to the processing elements of the processing slice. 

21. The apparatus of claim 20 further comprising a global accumulator unit to accumulate the 
results of each slice accumulator into a global accumulation register. 

22. The apparatus of claim 16 further comprising means for loading the data block into the 
plurality of block registers prior to loading the data blocks into the plurality of vector 
registers, such that the data blocks are loaded into the plurality of vector registers from 
the plurality of block registers, the means for loading data blocks into the plurality of 
block registers coupled to the means for loading the data blocks into a plurality of vector 
registers. 

23. The apparatus of claim 16 further comprising means for writing the results of the 
processing performed by the arithmetic logic units to the plurality of vector registers, the 
means for writing coupled to the means for processing. 

24. An apparatus to process video, the apparatus comprising a two-dimensional array of 
processing elements, wherein each processing element includes a plurality of vector 
registers, a plurality of block registers, a plurality of scalar registers, and a plurality of 
arithmetic logic units, further wherein a data path of each processing element includes a 
set of processing element slices each coupled to one arithmetic logic unit such that each 
arithmetic logic unit receives a specified portion of each vector register as input. 

25. The apparatus of claim 24 further comprising a main memory coupled to the two- 
dimensional array of processing elements, wherein video data is received by the main 
memory and configured as data blocks, the data blocks are loaded into the plurality of 
vector registers within each processing element, specified portions of each vector register 
are read by each of the corresponding arithmetic logic units simultaneously within all 
processing elements, the read portions are processed by the arithmetic logic units such 
that the data blocks from the plurality of vector registers are processed in parallel, and the 
results of the processing performed by the arithmetic logic units are written to the 
plurality of vector registers. 

-22- 



PATENT 
SONY-27400 



26. The apparatus of claim 25 wherein each arithmetic logic unit is further configured to 
receive the contents of one of the scalar registers as input. 

27. The apparatus of claim 26 wherein when the contents of one of the scalar registers is used 
as input, a scalar register value is broadcast from the one scalar register to all arithmetic 
logic units within a given processing element. 

28. The apparatus of claim 25 wherein the plurality of processing elements are further 
configured such that all processing elements within a same column of the array form a 
processing slice, and each processing element further comprises a local accumulation 
register to accumulate the results of the plurality of arithmetic logic units within the 
processing element, further wherein each processing slice is coupled to a slice 
accumulator such that each slice accumulator buffers the results from any one of the local 
accumulation registers corresponding to the processing elements of the processing slice. 

29. The apparatus of claim 28 further comprising a global accumulator unit to accumulate the 
results of each slice accumulator into a global accumulation register. 

30. The apparatus of claim 25 wherein the plurality of block registers receives the data blocks 
from the main memory and the plurality of vector registers receives the data blocks from 
the plurality of block registers. 

3 1 . The apparatus of claim 25 wherein each processing element further comprises a mask 
register to exclude processing of the processing element. 
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